28 research outputs found

    A Unified Optimization Approach for Sparse Tensor Operations on GPUs

    Full text link
    Sparse tensors appear in many large-scale applications with multidimensional and sparse data. While multidimensional sparse data often need to be processed on manycore processors, attempts to develop highly-optimized GPU-based implementations of sparse tensor operations are rare. The irregular computation patterns and sparsity structures as well as the large memory footprints of sparse tensor operations make such implementations challenging. We leverage the fact that sparse tensor operations share similar computation patterns to propose a unified tensor representation called F-COO. Combined with GPU-specific optimizations, F-COO provides highly-optimized implementations of sparse tensor computations on GPUs. The performance of the proposed unified approach is demonstrated for tensor-based kernels such as the Sparse Matricized Tensor- Times-Khatri-Rao Product (SpMTTKRP) and the Sparse Tensor- Times-Matrix Multiply (SpTTM) and is used in tensor decomposition algorithms. Compared to state-of-the-art work we improve the performance of SpTTM and SpMTTKRP up to 3.7 and 30.6 times respectively on NVIDIA Titan-X GPUs. We implement a CANDECOMP/PARAFAC (CP) decomposition and achieve up to 14.9 times speedup using the unified method over state-of-the-art libraries on NVIDIA Titan-X GPUs

    Sympiler: Transforming Sparse Matrix Codes by Decoupling Symbolic Analysis

    Full text link
    Sympiler is a domain-specific code generator that optimizes sparse matrix computations by decoupling the symbolic analysis phase from the numerical manipulation stage in sparse codes. The computation patterns in sparse numerical methods are guided by the input sparsity structure and the sparse algorithm itself. In many real-world simulations, the sparsity pattern changes little or not at all. Sympiler takes advantage of these properties to symbolically analyze sparse codes at compile-time and to apply inspector-guided transformations that enable applying low-level transformations to sparse codes. As a result, the Sympiler-generated code outperforms highly-optimized matrix factorization codes from commonly-used specialized libraries, obtaining average speedups over Eigen and CHOLMOD of 3.8X and 1.5X respectively.Comment: 12 page

    A Framework for Fine-Grained Synchronization of Dependent GPU Kernels

    Full text link
    Machine Learning (ML) models contain highly-parallel computations, such as, Matrix Multiplication, Convolutions, Dropout, etc. These computations are commonly executed on Graphics Processing Units (GPUs), by dividing the computation in independent processing blocks, known as tiles. Since the number of tiles are usually higher than the execution units of a GPU, tiles are executed on all execution units in waves. However, the tiles executed in the last wave can under-utilize the execution units because tiles are not always a multiple of execution units. This under-utilization can be reduced by executing multiple independent kernels concurrently on a GPU, but is not currently possible for dependent kernels. In this paper, we present cuSync, a framework to write custom fine-grained synchronization policies for dependent kernels to improve GPU utilization. cuSync synchronizes tiles instead of kernels, which allows executing tiles of multiple dependent kernels. Using cuSync we expressed several synchronization policies in a few lines of code and reduced the inference times of GPT-3 and ResNet-38 by up to 1.19x and 1.16x respectively

    Characterizing and enhancing smt clustered architectures

    No full text
    Bibliography: p. 105-11

    Krylov subspace techniques on graphic processing units

    No full text
    Computations related to many scientific and engineering problems spend most of their time in solving large, sparse linear systems. Improving the performance of these solvers on modern parallel architecture enables scientists to simulate large accurate models and manipulate massive amounts of data in reasonable time frames. Krylov subspace methods (KSM) are iterative techniques used to solve large sparse systems. The main time consuming kernels in KSMs are sparse matrix vector multiplication (SpMV), vector operations (dot products and vector sums) and preconditioner manipulation. This work presents techniques and algorithms to accelerate some of these kernels on a recent generation of parallel architecture called manycore processors. The performance of the proposed optimizations are tested on graphic processing units (GPUs) and compared to previous work. The SpMV kernel is accelerated on GPUs and speedups of up to 3.3 times are achieved compared to previous GPU implementations of the algorithm. The conjugate gradient iterative solver is accelerated on NVIDIA graphic cards and a 12.9 fold speedup is achieved compared to optimized implementation of the kernel on multicore CPUs. The sparse approximate inverse preconditioner is accelerated on GPUs and used to enhance the convergence rate of the BiCGStab iterative solver. The preconditioner is generated on NVIDIA GTX480 in the same time as it takes 16 AMD 252 Opteron processors to generate the same preconditioner.Communicating data between levels of a memory hierarchy and processors is time consuming and costly in KSMs. Communication-avoiding (CA) Krylov solvers take k steps of a KSM for the same communication cost as one step to reduce the communication overhead in standard KSMs. The matrix powers kernel in communication-avoiding Krylov solvers is accelerated on NVIDIA GPUs and speedups of up to 5.7 are achieved for the tested problems compared to the standard implementation of k SpMV kernels.Les calculs liés à de nombreux problèmes scientifiques et techniques demandent qu'on consacre beaucoup de temps à la résolution de grands systèmes linéaires creux. Améliorer la performance de ces résolveurs sur l'architecture paralléle moderne permet aux scientifiques de simuler de grands modèles précis et de manipuler une quantité massive de données dans des délais raisonnables. Les méthodes sous-espaces Krylov (KSM) sont des techniques itératives utilisées pour résoudre de grands systèmes creux. Les noyaux principaux qui demandent beaucoup de temps dans les KSMs sont la multiplication matrice-vecteur creuse (SpMV), les opérations sur les vecteurs (produits scalaires et sommes vectorielles) et la manipulation de préconditionneur. Ce travail présente les techniques et les algorithmes pour accélérer certains de ces noyaux sur une génération récente d'architecture parallèle appelée processeurs multicoeurs. La performance des optimisations proposées est testée sur des processeurs graphiques (GPU) et comparée aux travaux antérieurs.Le noyau SpMV est accéléré sur les processeurs graphiques et des accélérations jusqu'à 3.3 fois plus rapides sont atteintes par rapport aux implémentations de l'algorithme des processeurs graphiques précédents. Le gradient conjugué du résolveur itératif est accéléré sur des cartes graphiques NVIDIA et une accélération 12.9 fois plus rapide est réalisée par rapport à l'implémentation optimisée du noyau sur des processeurs multicœurs. Le préconditionneur approximatif inverse creux est accéléré sur les processeurs graphiques et utilisé pour améliorer le taux de convergence du résolveur itératif BiCGStab. Le préconditionneur est généré sur un NVIDIA GTX480 pour la même durée nécessaire à 16 processeurs AMD Opteron 252 pour générer le même préconditionneur.La communication de données entre les niveaux d'une hiérarchie de mémoire et des processeurs est longue et coûteuse en KSMs. Les résolveurs sans communication (communication-avoiding ou CA) de Krylov n'utilisent qu'un nombre k d'étapes d'une méthode de sous-espace de Krylov (KSM) pour un coût de communication équivalent comme une étape qui permet de réduire les frais généraux des communications dans les KSMs standards. Le noyau des pouvoirs de matrice dans les résolveurs de Krylov sans communication est accéléré sur les processeurs graphiques NVIDIA et des accélérations jusqu'à 5.7 plus rapides sont atteintes pour les problèmes testés par rapport à l'implémentation standard de k des noyaux SpMV
    corecore